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Abstract 


We  present  a  framework  for  categorical  shape  recognition.  The  coarse  shape  of  an 
object  is  captured  by  a  multiscale  blob  decomposition,  representing  the  compact 
and  elongated  parts  of  an  object  at  appropriate  scales.  These  parts,  in  turn,  map 
to  nodes  in  a  directed  acyclic  graph,  in  which  edges  encode  both  semantic  relations 
(parent/child  or  sibling)  as  well  as  geometric  relations.  Given  two  image  descrip¬ 
tions,  each  represented  as  a  directed  acyclic  graph,  we  draw  on  spectral  graph  theory 
to  derive  a  new  algorithm  for  computing  node  correspondence  in  the  presence  of 
noise  and  occlusion.  In  computing  correspondence,  the  similarity  of  two  nodes  is  a 
function  of  their  topological  (graph)  contexts,  their  geometric  (relational)  contexts, 
and  their  node  contents.  We  demonstrate  the  approach  on  the  domain  of  view-based 
3-D  object  recognition. 

Key  words:  generic  object  recognition,  shape  categorization,  graph  matching, 
scale  spaces,  spectral  graph  theory 
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1  Introduction 


Object  categorization  has  long  been  a  goal  of  the  object  recognition  commu¬ 
nity,  with  notable  early  work  by  Binford  [2],  Marr  and  Nishihara  [28],  Agin  and 
Binford  [1],  Nevatia  and  Binford  [31],  and  Brooks  [4]  attempting  to  construct 
and  recognize  prototypical  object  models  based  on  their  coarse  shape  struc¬ 
ture.  However,  the  representational  gap  between  low-level  image  features  and 
the  abstract  nature  of  the  models  was  large,  and  the  community  lacked  the 
computational  infrastructure  required  to  bridge  this  representational  gap  [18] . 
Instead,  images  were  simplified  and  objects  were  textureless,  so  that  extracted 
image  features  could  map  directly  to  salient  model  features.  In  the  years  to 
follow,  such  generic  object  recognition  systems  gave  way  to  exemplar-based 
systems,  first  passing  through  highly  constrained  geometric  (CAD)  models, 
and  on  to  today’s  generation  of  appearance-based  models.  Whether  the  im¬ 
ages  were  moved  closer  to  the  models  (earlier  approaches)  or  the  models  moved 
closer  to  the  images  (later  approaches),  the  representational  gap  has  remained 
largely  unaddressed. 

The  community  is  now  returning  to  the  problem  of  object  categorization,  us¬ 
ing  powerful  machine  learning  techniques  and  new  appearance-based  features. 
However,  appearance-based  representations  are  not  invariant  to  significant 
within-class  appearance  change,  due  to  texture,  surface  markings,  structural 
detail,  or  articulation.  As  a  result,  the  categories  are  often  very  restricted, 
such  as  faces,  cars,  motorcycles,  and  specific  species  of  animals.  Accommo¬ 
dating  significant  within-class  shape  variation  puts  significant  demands  on  a 
representation:  1)  it  must  capture  the  coarse  structure  of  an  object;  and  2)  it 
must  be  local,  in  order  to  accommodate  occlusion,  clutter  and  noise.  These 
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criteria  point  to  a  structured  shape  description  not  unlike  the  ambitious  part- 
based  recognition  frameworks  proposed  in  the  70’s  and  80’s.  However,  the 
representational  gap  facing  these  early  systems  still  poses  a  major  obstacle. 

We  begin  by  imposing  a  strong  shape  prior  on  the  parts  and  relations  making 
up  an  object.  Specifically,  we  represent  a  2-D  object  view  as  a  multiscale 
blob  decomposition,  whose  part  vocabulary  includes  two  types,  compact  blobs 
and  elongated  ridges,  and  whose  relations  also  include  two  types,  semantic 
(parent /child  and  sibling)  and  geometric.  The  detected  qualitative  parts  and 
and  their  relations  are  captured  in  a  directed  acyclic  graph,  called  a  blob  graph , 
in  which  nodes  represent  parts  and  edges  capture  relations. 

Choosing  a  restricted  vocabulary  of  parts  helps  us  bridge  the  representational 
gap.  Early  generation  systems  started  with  low-level  features  such  as  edges,  re¬ 
gions,  and  interest  points,  and  were  faced  with  the  challenging  task  of  grouping 
and  abstracting  them  to  form  high-level  parts,  such  as  generalized  cylinders. 
By  severely  restricting  the  part  vocabulary,  we  construct  a  high-level  part  de¬ 
tector  from  simple,  multiscale  filter  responses.  Although  the  parts  are  simple 
and  qualitative,  their  semantic  and  geometric  relations  add  rich  structure  to 
the  representation,  yielding  a  shape  representation  that  can  be  used  to  dis¬ 
criminate  shape  categories. 

Any  shortcut  to  extracting  high-level  part  structure  from  low-level  features, 
such  as  filter  responses,  will  be  prone  to  error.  Thus,  the  recovered  blob  graph 
will  be  missing  nodes/edges  and  will  contain  spurious  nodes/edges,  setting 
up  a  challenging  inexact  graph  matching  problem.  In  our  previous  work  on 
matching  rooted  trees,  [42],  we  drew  on  spectral  graph  theory  to  represent  the 
coarse  “shape”  of  a  tree  as  a  low- dimensional  vector  based  on  the  eigenvalues  of 
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the  tree’s  symmetric  adjacency  matrix.  Our  matching  algorithm  utilized  these 
vectors  in  a  coarse-to-fine  matching  strategy  that  found  corresponding  nodes. 
Although  successfully  applied  to  shock  trees,  the  matching  algorithm  suffered 
from  a  number  of  limitations:  1)  it  could  not  handle  the  directed  acyclic  graph 
structure  found  in,  for  example,  our  multiscale  blob  graphs,  e.g.,  a  blob  may 
have  two  parents;  2)  it  was  restricted  to  matching  hierarchical  parent/child 
relations,  and  could  not  accommodate  sibling  relations,  e.g.,  the  geometric 
relations  between  blobs  at  a  given  scale;  and  3)  the  matching  algorithm  was 
an  approximation  algorithm  that  could  not  ensure  that  hierarchical  and  sibling 
relations  were  preserved  during  matching,  allowing,  for  example,  two  siblings 
(sharing  a  parent)  in  one  tree  to  match  two  nodes  in  the  other  tree  having  an 
ancestor /descendant  relationship. 

We  first  extend  our  matching  algorithm  to  handle  directed  acyclic  graphs, 
drawing  on  our  recent  work  in  indexing  hierarchical  structures  [40],  in  which 
we  represent  the  coarse  shape  of  a  directed  acyclic  graph  as  a  low-dimensional 
vector  based  on  the  eigenvalues  of  the  DAG’s  antisymmetric  adjacency  matrix. 
Next,  we  introduce  a  notion  of  graph  neighborhood  context,  allowing  us  to  ac¬ 
commodate  sibling  relations,  such  as  our  blob  graphs’  geometric  relations,  into 
our  matching  algorithm.  Like  our  vector  encoding  of  hierarchical  structure, 
local  sibling  structure  is  also  encoded  in  a  low- dimensional  vector.  Finally, 
we  extend  the  matching  algorithm  to  ensure  that  that  both  hierarchical  and 
sibling  relations  are  enforced  during  matching,  yielding  improved  correspon¬ 
dence  in  the  presence  of  noise  and  occlusion.  The  result  is  a  far  more  powerful 
matching  framework  that’s  ideally  suited  to  our  multiscale  blob  graphs. 

Following  a  review  of  related  work  in  Section  2,  we  describe  our  qualitative 
shape  representation  in  Section  3.  Next,  we  describe  our  new  matching  algo- 
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rithm  in  Section  4.  In  Section  5,  we  evaluate  the  approach  on  the  domain  of 
view-based  3-D  object  recognition.  We  discuss  the  limitations  of  the  approach 
in  Section  6,  and  draw  conclusions  in  Section  7. 


2  Related  Work 

There  has  been  considerable  effort  devoted  to  both  scale  space  theory  and 
hierarchical  structure  matching,  although  much  less  effort  has  been  devoted  to 
combining  the  two  paradigms.  Coarse-to-fine  image  descriptions  are  plentiful, 
including  work  by  Burt  [5],  Lindeberg  [26],  Simoncelli  et  al.  [43],  Mallat  and 
Hwang  [27],  and  Within  and  Tennenbaum  [?].  Some  have  applied  such  models 
to  directing  visual  attention,  e.g.,  Tsotsos  [46],  Jagersand  [17],  Olshausen  et 
al.  [32],  and  Takacs  and  Wechsler  [44],  Although  such  descriptions  are  well 
suited  for  tasks  such  as  compression,  attention,  or  object  localization,  they 
often  lose  the  detailed  shape  information  required  for  object  recognition. 

Others  have  developed  multi-scale  image  descriptions  specifically  for  matching 
and  recognition.  Crowley  and  Sanderson  [8,7,9]  extracted  peaks  and  ridges 
from  a  Laplacian  pyramid  and  linked  them  together  to  form  a  tree  structure. 
However,  during  the  matching  phase,  little  of  the  trees’  topology  or  geometry 
was  exploited  to  compute  correspondence.  Rao  et  al.  [35]  correlate  a  rigid, 
multiscale  saliency  map  of  the  target  object  with  the  image.  However,  like 
any  template-based  approach,  the  technique  is  rather  global,  offering  little 
invariance  to  occlusion  or  object  deformation.  In  an  effort  to  accommodate 
object  deformation,  Wiskott  et  al.  apply  clastic  graph  matching  to  a  planar 
graph  whose  nodes  are  wavelet  jets.  Although  their  features  are  multi-scale, 
their  representation  is  not  hierarchical,  and  matching  requires  that  the  graphs 
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be  coarsely  aligned  in  scale  and  image  rotation  [48].  A  similar  approach  was 
applied  to  hand  posture  recognition  by  Triesch  and  von  der  Malsburg  [45]. 

The  representation  of  image  features  at  multiple  scales  suggests  a  hierarchical 
graph  representation,  which  can  accommodate  feature  attributes  in  the  nodes 
and  relational  attributes  in  the  arcs.  Although  graph  matching  is  a  popular 
topic  in  the  computer  vision  literature  [12],  including  both  inexact  and  exact 
graph  matching  algorithms,  there  is  far  less  work  on  dealing  with  the  matching 
of  hierarchical  graphs,  i.e.,  DAGs,  in  which  lower  (deeper)  levels  reflect  less 
saliency.  Pelillo  et  al.  [34]  provided  a  solution  for  the  matching  of  hierarchical 
trees  by  constructing  an  association  graph  using  the  concept  of  connectivity 
and  solving  a  maximal  clique  problem  in  this  new  structure.  The  latter  prob¬ 
lem  can  be  formulated  as  a  quadratic  optimization  problem  and  they  used 
the  so-called  replicator  dynamical  systems  developed  in  theoretical  biology  to 
solve  it.  Pelillo  [33]  also  generalized  the  framework  for  matching  free  trees  and 
using  simple  payoff-monotonic  dynamics  from  evolutionary  game  theory.  The 
problem  of  matching  hierarchical  trees  has  also  been  studied  in  the  context 
of  edit-distance  (see,  e.g.,  [38]).  In  such  a  setting,  one  seeks  a  minimal  set  of 
re-labellings,  additions,  deletions,  merges,  and  splits  of  nodes  and  edges  that 
transform  one  graph  into  another. 

In  recent  work  [10,11],  we  presented  a  framework  for  many-to-many  matching 
of  hierarchical  structures,  where  features  and  their  relations  were  represented 
using  directed  edge-weighted  graphs.  The  method  began  with  transforming 
the  graph  into  a  metric  tree.  Next,  using  graph  embedding  techniques,  the 
tree  was  embedded  into  a  normed  vector  space.  This  two-step  transformation 
allowed  us  to  reduce  the  problem  of  many-to-many  graph  matching  to  a  much 
simpler  problem  of  matching  weighted  distributions  of  points  in  a  normed 
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vector  space.  To  compute  the  distance  between  two  weighted  distributions, 
we  used  a  distribution-based  similarity  measure,  known  as  the  Earth  Mover’s 
Distance  under  transformation  [6,37]. 

As  mentioned  in  Section  1,  object  categorization  has  received  renewed  atten¬ 
tion  from  the  recognition  community.  In  [14],  Fergus  et  al.  present  a  method 
to  learn  and  recognize  object  class  models  from  unlabeled  cluttered  scenes 
in  a  scale  invariant  manner.  They  deploy  a  probabilistic  representation  to  si¬ 
multaneously  model  shape,  appearance,  occlusion,  and  relative  scale.  They 
also  use  expectation  maximization  for  learning  the  parameters  of  the  scale- 
invariant  object  model  and  use  this  model  in  a  Bayesian  manner  to  classify 
images.  Fei-Fei  et  al.  [13]  improved  this  by  proposing  a  method  for  learning 
object  categories  from  just  a  few  training  images.  Their  proposed  method  is 
based  on  making  use  of  priors  to  construct  a  generative  probabilistic  model, 
assembled  from  object  categories  which  were  previously  learned. 

Lazebnik  et  al.  [21]  present  a  framework  for  the  representation  of  3-D  objects 
using  multiple  composite  local  affine  parts.  Specifically,  these  are  2D  config¬ 
urations  of  regions  that  are  stable  across  a  range  of  views  of  an  object,  and 
also  across  multiple  instances  of  the  same  object  category.  These  compos¬ 
ite  representations  provide  improved  expressiveness  and  distinctiveness,  along 
with  added  flexibility  for  representing  complex  3-D  objects.  Leibe  and  Schiele 
[23]  propose  a  new  database  for  comparing  different  methods  for  object  cat¬ 
egorization.  The  database  contains  high-resolution  color  images  of  80  objects 
from  8  different  categories,  for  a  total  of  3280  images  and  was  used  to  ana¬ 
lyze  the  performance  of  several  appearance  and  contour  based  methods.  The 
best  reported  method  for  categorization  for  this  database  is  a  combination  of 
both  contour-  and  shape-based  methods.  This  new  generation  of  categoriza- 


tion  systems  are  primarily  appearance-based,  and  therefore  not  well-equipped 
to  handle  within-class  deformations  due  to  significant  appearance  change,  ar¬ 
ticulation,  significant  changes  in  minor  geometric  detail. 

The  closest  integrated  framework  to  that  proposed  here  is  due  to  Shokoufandeh 
et  al.  [41],  who  match  multiscale  blob  representations  represented  as  directed 
acyclic  graphs.  The  multi-scale  description  in  that  work,  due  to  Marsic  [29], 
excluded  geometric  relations  among  sibling  nodes  and  did  not  include  ridge 
features.  Moreover,  the  matching  framework  had  to  choose  between  two  al¬ 
gorithms,  one  targeting  structural  matching,  while  the  other  enforcing  both 
structural  and  geometrical  graph  alignment.  Our  proposed  multiscale  image 
representation  is  far  richer  in  terms  of  its  underlying  features,  and  resembles 
that  of  Bretzner  and  Lindeberg  [3],  who  explored  qualitative,  multi-scale  hi¬ 
erarchies  for  object  tracking.  Our  matching  framework,  on  the  other  hand, 
offers  several  orders  of  magnitude  less  complexity,  handles  noise  more  effec¬ 
tively,  and  can  handle  both  structural  and  geometrical  matching  within  the 
same  framework. 


3  Building  a  Qualitative  Shape  Feature  Hierarchy 

3.1  Extracting  Qualitative  Shape  Features 

As  mentioned  in  Section  1,  our  qualitative  feature  hierarchy  represents  an  im¬ 
age  in  terms  of  a  set  of  blobs  and  ridges,  captured  at  appropriate  scales.  The 
representation  is  an  extension  of  the  work  presented  in  [3].  Blob  and  ridge  ex¬ 
traction  is  performed  using  automatic  scale  selection,  as  described  in  previous 
work  (see  [25]  and  [24]).  We  also  extract  a  third  descriptor,  called  the  win- 
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(a)  (b) 


Fig.  1.  Feature  Extraction:  (a)  extracted  blobs  and  ridges  at  appropriate  scales;  (b) 
extracted  features  before  and  after  removing  multiple  responses  and  linking  ridges. 

domed  second  moment  matrix ,  which  describes  the  directional  characteristics 
of  the  underlying  image  structure. 

A  scale-space  representation  of  the  image  signal  /  is  computed  by  convolution 
with  Gaussian  kernels  g(-;  t )  of  different  variance  t,  giving  L(-;  t)  =  <?(■;  t )  * 
/(•).  Blob  detection  aims  at  locating  compact  objects  or  parts  in  the  image. 
The  entity  used  to  detect  blobs  is  the  square  of  the  normalized  Laplacian 
operator, 

^ norrrJ-1  =  ^  {^xx  +  Lyy).  (1) 

Blobs  are  detected  as  local  maxima  in  scale-space.  Figure  1(a)  shows  an  im¬ 
age  of  a  hand  with  the  extracted  blobs  superimposed.  A  blob  is  graphically 
represented  by  a  circle  defining  a  support  region ,  whose  radius  is  proportional 
to  yf(t). 

Elongated  structures  are  localized  where  the  multi-scale  ridge  detector 

KnormL  =  t3/ 2  {{Lxx  -  Lyy)2  +  4 L^)  (2) 

assumes  local  maxima  in  scale-space.  Figure  1(a)  also  shows  the  extracted 
ridges,  represented  as  superimposed  ellipses,  each  defining  a  support  region, 
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with  width  proportional  to  y  (t).  To  represent  the  spatial  extent  of  a  detected 
image  structure,  a  windowed  second  moment  matrix 


/ 


E  = 


7?eiR2 


\ 


L\  LxLy 


LxLy  Ly 


g(w  Unt )  dr] 


(3) 


is  computed  at  the  detected  feature  position  and  at  an  integration  scale  tint 
proportional  to  the  scale  tdet  of  the  detected  image  feature.  There  are  two  pa¬ 
rameters  of  the  directional  statistics  that  we  make  use  of  here:  the  orientation 
and  the  anisotropy,  given  from  the  eigenvalues  Ai  and  A2  (Ai  >  A2)  and  their 
corresponding  eigenvectors  e\l  and  e\2  of  E.  The  anisotropy  is  defined  as 

~  ^  1  -  A2/Ai 
^  l  +  Aa/A,’ 


while  the  orientation  is  given  by  the  direction  of  e^. 

To  improve  feature  detection  in  scenes  with  poor  intensity  contrast  between 
the  image  object  and  background,  we  utilize  color  information.  This  is  done 
by  extracting  features  in  the  R,  G  and  B  color  bands  separately,  along  with 
the  intensity  image.  Re-occurring  features  are  awarded  with  respect  to  signifi¬ 
cance.  Furthermore,  if  we  have  advance  information  on  the  color  of  the  object, 
improvements  can  be  achieved  by  weighting  the  significance  of  the  features  in 
the  color  bands  differently. 

When  constructing  a  feature  hierarchy,  we  extract  the  N  most  salient  image 
features,  ranked  according  to  the  response  of  the  scale-space  descriptors  used 
in  the  feature  extraction  process.  From  these  features,  a  Feature  Map  is  built 
according  to  the  following  steps: 
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3.1.1  Merging  multiple  feature  responses. 


This  step  removes  multiple  overlapping  responses  originating  from  the  same 
image  structure,  the  effect  of  which  can  be  seen  in  Figure  1(a).  To  be  able  to 
detect  overlapping  features,  a  measure  of  inter-feature  similarity  is  needed.  For 
this  purpose,  each  feature  is  associated  with  a  2-D  Gaussian  kernel  g(x,  £), 
where  the  covariance  is  given  from  the  second  moment  matrix.  When  two  fea¬ 
tures  are  positioned  near  each  other,  their  Gaussian  functions  will  intersect. 
The  similarity  measure  between  two  such  features  is  based  on  the  disjunct 
volume  D  of  the  two  Gaussians  [20].  This  volume  is  calculated  by  integrat¬ 
ing  the  square  of  the  difference  between  the  two  Gaussian  functions  (.9,4,  gn) 
corresponding  to  the  two  intersecting  features  A  and  B : 

D(A,  B)  =  j  (, gA  _  gBfd X.  (5) 

The  disjunct  volume  depends  on  the  differences  in  position,  variance,  scale  and 
orientation  of  the  two  Gaussians,  and  for  ridges  is  more  sensitive  to  translations 
in  the  direction  perpendicular  to  the  ridge. 

3.1.2  Linking  ridges 

The  ridge  detection  will  produce  multiple  responses  on  a  ridge  structure  that  is 
long  compared  to  its  width.  These  ridges  are  linked  together  to  form  one  long 
ridge,  as  illustrated  in  Figure  1(b).  The  criteria  for  when  to  link  two  ridges  is 
based  on  two  conditions:  1)  they  must  be  aligned,  and  2)  their  support  regions 
must  overlap.  After  the  linking  is  performed,  the  anisotropy  and  support  region 
for  the  resulting  ridge  is  re-calculated.  The  anisotropy  is  re-calculated  from  the 
new  length/width  relationship  as  1  -{width  of  structure) /(length  of  structure). 
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3.2  Assembling  the  Features  into  a  Graph 


Once  the  Feature  Map  is  constructed,  the  component  features  are  assembled 
into  a  directed  acyclic  graph.  Associated  with  each  node  (blob/ridge)  are  a 
number  of  attributes,  including  position,  orientation,  and  support  region.  A 
feature  at  the  coarsest  scale  of  the  Feature  Map  is  chosen  as  the  root.  Next, 
hner-scale  features  that  overlap  with  the  root  become  its  children  through 
hierarchical  edges.  These  children,  in  turn,  select  overlapping  features  (again 
through  hierarchical  edges)  at  finer  scales  to  be  their  children,  etc.  ^Frorn  the 
unassigned  features,  the  feature  at  the  coarsest  scale  is  chosen  as  a  new  root. 
Children  of  this  root  are  selected  from  unassigned  as  well  as  assigned  features, 
and  the  process  is  repeated  until  all  features  are  assigned  to  a  graph.  A  child 
node  can  therefore  have  multiple  parents.  To  yield  a  single  rooted  graph,  which 
is  needed  in  the  matching  step,  a  virtual  top  root  node  is  inserted  as  the  parent 
of  all  root  nodes  in  the  image. 

Associated  with  each  edge  are  a  number  of  important  geometric  attributes. 
For  an  edge  £,  directed  from  a  vertex  V4  representing  feature  Fa,  to  a  vertex 
Vb  representing  feature  Fb,  we  define  the  following  attributes,  as  shown  in 
Figure  2: 

•  Distance.  Two  measures  of  inter-feature  distance  are  associated  with  the 
edge:  1)  the  smallest  distance  d  from  the  support  region  of  Fa  to  the  support 
region  of  Fb,  normalized  to  the  the  largest  of  the  radii  Ta  and  rB]  and  2) 
the  distance  between  their  centers  normalized  to  the  radius  t'a  of  Fa  hr  the 
direction  of  the  distance  vector  between  their  centers. 

•  Relative  orientation.  The  relative  orientation  between  Fa  and  FB- 
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Feature  B 


(a)  (b)  (c)  (d) 

Fig.  2.  The  four  edge  relations:  (a,b)  two  normalized  distance  measures,  (c)  relative 
orientation,  and  (d)  bearing 


Fig.  3.  Example  graph  of  a  hand  image,  with  the  hierarchical  edges  shown  in  green. 


•  Bearing.  The  bearing  of  a  feature  Tb-,  as  seen  from  a  feature  T a  •  is  defined 
as  the  angle  of  the  distance  vector  xb  —  %a  with  respect  to  the  orientation 
of  A  measured  counter-clockwise. 

•  Scale  ratio.  The  scale  invariant  relation  between  Ta  and  Tp  is  a  ratio 
between  scales  t?A  and  tjrB. 

An  example  of  a  blob  graph  for  a  hand  image,  showing  hierarchical  edges,  is 
shown  in  Figure  3. 
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4  Matching  Problem  Formulation 


Given  two  images  and  their  corresponding  feature  map  graphs,  G\  =  (Vi,  E\) 
and  G2  =  (V\,Ei),  with  |Vi|  =  ri\  and  \V2\  =  n2,  we  seek  a  method  for  com¬ 
puting  their  similarity.  In  the  absence  of  noise,  segmentation  errors,  occlusion, 
and  clutter,  computing  the  similarity  of  G\  and  G-2  could  be  formulated  as  a 
label-consistent  graph  isomorphism  problem.  However,  under  normal  imaging 
conditions,  there  may  not  exist  significant  subgraphs  common  to  G\  and  G2. 
We  therefore  seek  an  approximate  solution  that  captures  both  the  structural 
and  geometrical  similarity  of  G\  and  G-2  as  well  as  corresponding  node  sim¬ 
ilarity.  Structural  similarity  is  a  domain-independent  measure  that  accounts 
for  similarity  in  the  “shapes”  of  two  graphs,  in  terms  of  numbers  of  nodes, 
branching  factor  distributions,  etc.  Geometrical  similarity,  on  the  other  hand, 
accounts  for  consistency  in  the  relative  positions,  orientations,  and  scales  of 
nodes  in  the  two  graphs.  In  the  following  subsections,  we  describe  these  two 
signatures  and  combine  them  in  an  efficient  algorithm  to  match  two  blob 
graphs. 


4-1  Encoding  Graph  Structure 

As  described  in  Section  1,  our  previous  work  on  rooted  tree  matching  drew 
on  the  eigenvalues  of  a  tree’s  symmetric  {0, 1}  adjacency  matrix  to  encode 
the  “shape”  of  a  tree  using  a  low-dimensional  vector.  The  eigenvalues  of  a 
graph’s  adjacency  matrix  characterize  the  graph’s  degree  distribution,  an  im¬ 
portant  structural  property  of  the  graph.  In  extending  that  approach  to  DAG 
matching,  we  first  draw  on  our  recent  work  in  indexing  hierarchical  (DAG) 
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structures  [40],  in  which  the  magnitudes  of  the  eigenvalues  of  a  DAG’s  an¬ 
tisymmetric  {0,1,— 1}  adjacency  matrix1  are  used  to  encode  the  shape  of 
a  DAG  using  a  low-dimensional  vector.  Moreover,  in  [40],  we  show  that  the 
eigenvalues  are  invariant  to  minor  structural  perturbation  of  the  graph  dne  to 
noise  and  occlusion. 

Let’s  briefly  review  the  construction  of  our  graph  abstraction;  details  can  be 
found  in  [40].  Let  2D  be  a  DAG  whose  maximum  branching  factor  is  A  (2)), 
and  let  the  subgraphs  of  its  root  be  2Di,  2D2, . . . ,  Ds,  as  shown  in  Figure  4.  For 
each  subgraph,  2Dj,  whose  root  degree  is  <5(5Dj),  we  compute  the  magnitudes  of 
the  (complex)  eigenvalues  of  2Dj’ s  submatrix,  sort  the  magnitudes  in  decreasing 
order,  and  let  St  be  the  sum  of  the  A  (3D*)  —  1  largest  magnitudes.  The  sorted  Si  s 
become  the  components  of  a  A  (2D) -dimensional  vector  assigned  to  the  DAG’s 
root.  If  the  number  of  S{'s  is  less  than  A  (2D),  then  the  vector  is  padded  with 
zeroes.  We  can  recursively  repeat  this  procedure,  assigning  a  vector  to  each 
nonterminal  node  in  the  DAG,  computed  over  the  subgraph  rooted  at  that 
node.  We  call  each  such  vector  a  topological  signature  vector ,  or  TSV.  The 
TSV  assigned  to  a  node  allows  the  structural  context  of  the  node  (i.e.,  the 
subgraph  rooted  at  the  node)  to  be  encapsulated  in  the  node  as  an  attribute. 

4-2  Encoding  Graph  Geometry 

The  above  encoding  of  structure  suffers  from  the  drawback  that  it  does  not 
capture  the  geometry  of  the  nodes.  For  example,  two  graphs  with  identical 
structure  may  differ  in  terms  of  the  relative  positions  of  their  nodes,  the  rela- 

1  A  matrix  with  l’s  (-l’s)  indicating  a  forward  (backward)  edge  between  adjacent 
nodes  in  the  graph  (and  0’s  on  the  diagonal). 
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1.  Compute  eigensums 

S I  =  Aj  +  A2  +. ..  +  Ak.J 

2.  Sort  eigensums 

S2>Sl>S}>..>S„ 

3.  Form  topological  signature 

TSV(r)  =  (52,5],S3,...,5„,0) 


Fig.  4.  Forming  the  structural  signature. 

tive  orientations  of  their  nodes  (for  elongated  nodes),  and  the  relative  scales  of 
their  nodes.  Just  as  we  derived  a  topological  signature  vector,  which  encodes 
the  “neighborhood”  structure  of  a  node,  we  now  seek  an  analogous  “geomet¬ 
rical  signature  vector”,  which  encodes  the  neighborhood  geometry  of  a  node. 
This  geometrical  signature  will  be  combined  with  our  new  topological  sig¬ 
nature  in  a  new  algorithm  that  computes  the  distance  between  two  directed 
acyclic  graphs  and  preserves  hierarchical  and  sibling  constraints. 

Let  G  =  (V,  E)  be  a  graph  to  be  recognized  (input  image).  For  every  pair 
of  vertices  u,v  €  V,  if  there  is  an  edge  (B  =  ( u ,  v)  between  them,  we  let 
RUiV  denote  the  attribute  vector  associated  with  edge  C2.  The  entries  of  each 
such  vector  represent  the  set  of  relations  2H  =  {distance,  relative  orientation, 
bearing,  scale  ratio}  between  u  and  v,  as  shown  in  Figure  5.  For  a  vertex 
u  G  V,  we  let  N(u)  denote  the  set  of  vertices  v  E  V  such  that  the  directed 
edge  (u,  v )  corresponds  to  a  sibling  relation.  For  a  relation  p  e  we  will  use 
2p(xi,p)  to  denote  the  distribution  of  values  of  relation  p  between  vertex  u  and 
all  the  vertices  in  the  set  N(u),  i.e.,  B$(u,p)  is  a  histogram  encoding  the  pth 
entry  of  the  vectors  RUAJ  for  v  G  N(u).  2 

2  The  exception  to  this  rule  is  the  orientation  relation.  Rather  than  use  absolute 
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Fig.  5.  Forming  the  geometric  signature. 


Given  two  graphs  G  =  (V,  E )  and  G'  =  (V1,  E')  with  vertices  u  G  V  and 
v!  G  V7,  we  compute  the  similarity  between  u  and  v!  in  terms  of  their  respective 
distributions  ty(u,p)  and  ty(u',p),  for  all  p  G  91.  Now,  let  dp(u,u')  denote  the 
Earth  Movers  Distance  (EMD)  [37]  between  two  such  distributions  *$(u,p) 
and  ty(u',p),  i.e.,  dp(u,u')  denotes  the  minimum  amount  of  work  (defined 
in  terms  of  displacements  of  the  masses  associated  with  histograms  fp (u,p) 
and  ty(u',p))  it  takes  to  transform  one  distribution  into  another.  The  main 
advantage  of  using  EMD  to  compute  dp(u,u')  lies  in  the  fact  that  it  subsumes 
many  histogram  distances  and  permits  partial  matches  in  a  natural  way.  This 
important  property  allows  the  similarity  measure  to  deal  with  the  case  where 
the  masses  associated  with  distributions  fp (u,p)  and  fp (u',p)  are  not  equal. 
Details  of  the  method  are  presented  in  [19].  Given  the  the  values  of  dp(u,u') 
for  all  p  G  91,  we  arrive  at  a  final  node  similarity  function  for  vertices  u  and 
u!\ 

a(u,  u)  =  e~  dP(u,u,  )  ^ 


orientation,  measured  with  respect  to  a  reference  direction,  we  instead  use  the  angle 
from  the  previous  edge  in  a  clockwise  ordering  of  edges  emanating  from  a  vertex. 
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4-3  Matching  Algorithm 


As  mentioned  in  Section  1,  our  previous  work  addressed  the  problem  of  match¬ 
ing  rooted  trees,  and  was  unable  to  match  directed  acyclic  graphs,  unable  to 
accommodate  geometric  relations  among  nodes,  and  unable  to  preserve  hier¬ 
archical  and  sibling  relations.  Still,  it  serves  as  the  starting  point  for  our  new 
algorithm,  and  we  review  it  accordingly.  The  method  was  a  modified  version  of 
Reyner’s  algorithm  [36,47]  for  finding  the  largest  common  subtree.  The  main 
idea  of  the  algorithm  was  to  cast  the  structural  matching  problem  as  a  set 
of  bipartite  graph  matching  problems.  A  similarity  matrix  between  the  two 
graphs’  nodes  was  constructed  with  each  entry  computing  the  pairwise  simi¬ 
larity  between  a  particular  node  in  the  first  tree  and  a  node  in  the  second  tree. 
This  similarity  measure  was  a  weighted  combination  of  the  distance  between 
the  two  nodes’  TSVs,  reflecting  the  extent  to  which  their  underlying  subtrees 
had  similar  structure,  and  the  two  nodes’  internal  attributes.  3 

The  algorithm  is  illustrated  in  Figure  6.  The  best  pairwise  node  correspon¬ 
dence  obtained  after  a  maximum  cardinality  maximum  weight  (MCMW)  bi¬ 
partite  matching  is  extracted  and  put  into  the  solution  set  of  correspondences. 
In  a  greedy  fashion,  the  algorithm  recursively  matches  the  two  resulting  pairs 
of  corresponding  forests,  at  each  step  computing  a  maximum  matching  and 
placing  the  best  corresponding  pair  of  nodes  in  the  solution  set.  The  key  idea  in 
casting  a  graph-matching  problem  as  a  number  of  MCMW  bipartite  matching 
problems  is  to  use  the  topological  signature  vectors  (TSV)  to  penalize  nodes 
with  different  underlying  graph  structure.  This  effectively  allows  us  to  discard 

3  For  shock  graphs,  each  node  encoded  a  set  of  medial  axis,  or  shock,  points  and 
their  similarity  was  computed  based  on  a  Hausdorff  distance  between  these  point 
sets. 
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Fig.  6.  The  DAG  matching  algorithm,  (a)  Given  a  query  graph  and  a  model  graph, 
(b)  form  bipartite  graph  in  which  the  edge  weights  are  the  pair-wise  node  similari¬ 
ties.  Then,  (c)  compute  a  maximum  matching  and  add  the  best  edge  to  the  solution 
set.  Finally,  (d)  split  the  graphs  at  the  matched  nodes  and  (e)  recursively  descend. 


the  graphs’  edge  structure  and  formulate  the  problem  as  an  attributed  point 
matching  problem,  with  a  node’s  underlying  structural  context  encoded  as  a 
low-dimensional  vector  node  attribute. 

To  extend  this  framework  to  accommodate  DAG  matching,  geometric  rela¬ 
tions,  and  hierarchical/sibling  constraint  satisfaction,  we  begin  by  introducing 
some  definitions  and  notations.  Let  0  =  (Vq,  Eq)  and  371  =  (Vm,  E^n)  be  the 
two  DAGs  to  be  matched,  with  |Vq|  =  riQ  and  |Vgji|  =  nm-  Define  d  to  be  the 
maximum  degree  of  any  vertex  in  0  and  371,  i.e.,  d  =  max(<5(0),  (5(971)).  For 
each  vertex  v,  let  x(v)  *=  Rd  be  the  unique  topological  signature  vector  (TSV), 
introduced  in  Section  4.1.  The  bipartite  edge  weighted  graph  <5(1/q,  Vgjt,  E&) 
is  represented  as  a  uq  x  n<xa  matrix  W  whose  (q,  m)-th  entry  has  the  value: 


w q,m  =  a  <r{q,m)  +  (1  -  a)  (|| x{q)  ~~  xMH) , 


(6) 
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where  a(q,  m)  denotes  the  node  similarity  between  nodes  q  G  £2  and  m  G  Wl, 
and  a  is  a  convexity  parameter  that  weights  the  relevance  of  each  term.  Using 
the  scaling  algorithm  of  Gabow  and  Tarjan  [15],  we  can  efficiently  compute 
the  maximum  cardinality,  maximum  weight  matching  in  0  with  complexity 
0(|  V|  \E\),  resulting  in  a  list  of  node  correspondences  between  £3  and  931,  called 
£,  that  can  be  ranked  in  decreasing  order  of  similarity. 

This  set  of  node  correspondences  maximizes  the  sum  of  node  similarities, 
but  does  not  enforce  any  hierarchical  constraints  other  than  the  implicit  ones 
encoded  in  ( 1 1 x(<?)  —  x(m)ID-  Thus,  instead  of  using  all  the  node  correspon¬ 
dences,  we  take  a  greedy  approach  and  assume  that  only  the  first  one  is  correct, 
and  remove  the  subgraphs  rooted  at  the  selected  matched  pair  of  nodes.  We 
now  have  two  smaller  problems  of  graph  matching,  one  for  the  pair  of  removed 
subgraphs,  and  another  for  the  two  remainders  of  the  original  graphs.  Both 
subproblems  can,  in  turn,  be  solved  by  a  recursive  call  of  the  above  algorithm. 
The  complexity  of  such  a  recursive  algorithm  is  0(n3). 

It  turns  out  that  splitting  subgraphs  at  nodes  with  high  confidence  of  being  a 
good  correspondence  is  not  a  strong  enough  constraint  to  guarantee  that  all 
the  hierarchical  relations  are  satisfied.  Consider,  for  example,  the  graphs  in 
Figure  7.  After  the  first  iteration  of  the  matching  algorithm,  nodes  (g5,  rn$)  will 
be  matched  since  their  similarity  is  the  highest  one  in  £i .  In  the  next  iteration, 
the  subgraph  rooted  at  q$,  0*,  and  the  subgraph  rooted  at  777.5,  97T,  as  well 
as  their  corresponding  complement  graphs  0C  and  97lc,  will  be  recursively 
evaluated.  4  When  matching  0C  against  97tc,  the  best  node  correspondence 
according  to  the  outlined  algorithm  will  be  (54,7773).  It  is  easy  to  see  that  this 

4  In  the  recursive  call  for  0*  and  9Jt*,  nodes  55  and  7775  will  be  in  the  solution  set, 
and  so  they  will  not  be  evaluated  again. 
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Fig.  7.  A  case  in  which  the  hierarchical  constraints  between  query  nodes  and  model 
nodes  will  be  violated  after  two  iterations  of  the  algorithm.  Note  that  only  non-zero 
W?;m  values  are  shown. 

match  violates  the  hierarchical  constraints  among  nodes  because  the  siblings 
g4  and  q5  are  mapped  to  m3  and  m5,  respectively,  with  m3  a  parent  of  m5. 

Another  constraint  that  arises  in  several  domains  is  that  of  preserving  sibling 
relationships.  Note  that  this  constraint  is  not,  strictly  speaking,  a  hierarchical 
constraint,  since  there  are  no  hierarchical  dependencies  among  sibling  nodes. 
While  it  may  be  tempting  to  enforce  this  constraint  when  matching,  there  is  a 
possibility  that  a  sibling  relation  is  genuinely  broken  by  an  occluder.  In  such 
a  case,  we  may  not  want  to  enforce  this  constraint  or  else  we  will  be  unable 
to  find  meaningful  matching  subgraphs.  A  compromise  solution  would  be  to 
penalize  the  matches  that  break  a  sibling  relationship  so  as  to  favor  those  that 
provide  a  good  set  of  correspondences  while  maintaining  these  relationships 
intact. 

Figure  8  illustrates  the  problem.  Assuming  that  (<75,  m4)  has  just  been  added 
to  the  solution  set,  the  next  best  correspondence  is  (g4,  m5),  which  violates  a 
sibling  constraint.  To  avoid  this,  we  can  propagate  the  information  provided 
by  the  previous  best  match,  (g5,m4).  This  information  is  used  to  favor  g4’s 
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Fig.  8.  Preserving  sibling  relationships  by  propagating  the  information  from  the 
previous  best  match.  The  matching  pair  ((74,7713)  is  chosen  over  the  slightly  better 
match  ((74, 777,5),  because  it  results  in  two  siblings  in  the  query  being  matched  to  two 
siblings  in  the  model. 

sibling,  so  that  (g4,  m3)  can  be  chosen  instead.  Since  we  do  not  want  to  become 
too  sensitive  to  noise  in  the  graph,  we  shall  consider  preserving  the  sibling- 
or-sibling-descendant  relationships  instead  of  the  stricter  sibling  relationship. 
We  will  refer  to  this  asymmetric  relation  between  nodes  as  the  SSD  relation.  5 
Note  that  due  to  the  asymmetry  of  the  relation,  the  desired  propagation  of 
information  will  occur  only  whenever  the  algorithm  proceeds  in  a  top-down 
fashion.  In  the  next  section,  we  will  see  how  to  promote  a  top-down  node 
matching. 

Before  continuing,  let  us  define  the  rather  intuitive  node  relationships  that  we 
will  be  working  with.  Let  0(V,  E)  be  a  DAG  and  let  u,v  be  two  nodes  in  V. 
We  say  that  u  is  a  parent  of  v  if  there  is  an  edge  from  u  to  v.  Furthermore, 
let  u  be  the  ancestor  of  v  if  and  only  if  there  is  a  path  from  u  to  v.  Similarly, 
let  u  be  a  SSD  of  v  if  and  only  if  there  exists  a  parent  of  u  that  is  also  an 
ancestor  of  v. 

5  Note  that  while  the  sibling  relationship  is  symmetric,  the  SSD  relationship  is  not, 
i.e. ,  if  u  is  the  “nephew”  of  v.  then  SSD(u,v)  is  true,  but  SSD(v,u)  is  false. 
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The  relations  defined  above  will  allow  us  to  express  the  desired  constraints. 
However,  we  first  need  to  determine  how  to  make  this  information  explicit, 
for  it  is  not  immediately  available  from  the  adjacency  matrices  of  the  graphs. 
A  simple  method  is  to  compute  the  transitive  closure  graphs  of  our  graphs. 
The  transitive  closure  of  a  directed  graph  (25  =  {V,  E)  is  a  graph  S)  =  (V,F), 
with  (v,  w)  in  F  if  and  only  if  there  is  a  path  from  v  to  w  in  ©.  The  transitive 
closure  can  be  computed  in  linear  time  in  the  size  of  the  graph  0(|V|  \E\)  [16]. 

^Frorn  the  above  definition,  it  is  easy  to  see  that  the  transitive  closure  of  a 
graph  is  nothing  else  than  the  ancestor  relation.  Computing  the  SSD  relation, 
on  the  contrary,  requires  a  bit  of  extra  work.  Let  Ag  be  the  adjacency  matrix 
of  the  DAG,  &(V,E),  and  let  T©  be  the  adjacency  matrix  of  the  transitive 
closure  graph  of  (25.  By  means  of  these  two  matrices,  we  can  now  compute  the 
non-symmetric  SSD  relation  by  defining  S©  as  the  \V\  x  \V\  matrix,  where 


S e(u,v) 


I  1  if  3weV{A<s(w,v)  =  1  &  T©(w,w)  =  1}, 
0  otherwise. 


(7) 


Armed  with  our  new  matrices  Tq,  Tot,  Sjq,  and  Sgn,  we  can  update  the  sim¬ 
ilarity  matrix,  W,  at  each  iteration  of  the  algorithm,  so  as  to  preserve  the 
ancestor  relations  and  to  discourage  breaking  SSD  relations.  At  the  first  it¬ 
eration,  n  =  0,  we  start  with  W°  =  W.  Next,  let  ( q',m ')  be  the  best  node 
correspondence  selected  at  the  n-th  iteration  of  the  algorithm,  for  n  >  0.  The 
new  weights  for  each  entry  W”^1  of  the  similarity  matrix,  which  will  be  used 
as  edge  weights  in  the  bipartite  graph  at  iteration  n  +  1,  are  updated  according 
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to: 


Wn+1  =  1 

q,m  v 


0  if  T Q(g,  qr)  ±  T m(m,  m'), 

P  wlm  else  if  so(<?,  q')  ±  S m(m,  m'), 
W”m  otherwise; 


(8) 


where  0  <  (3  <  1  is  a  term  that  penalizes  a  pair  of  sibling  nodes  in  the  query 
being  matched  to  a  pair  of  non-sibling  nodes  in  the  model.  It  is  sufficient  to 
apply  a  small  penalty  to  these  cases,  since  the  goal  is  simply  to  favor  siblings 
over  non-siblings  when  the  similarities  of  the  others  are  comparable  to  that  of 
the  siblings. 

It  is  clear  that  when  q'  and  m'  are  the  roots  of  the  subgraphs  to  match, 
the  ancestor  and  SSD  relations  will  be  true  for  all  the  nodes  in  the  DAG. 
Thus,  in  practice,  when  matching  the  g'-rooted  and  m'-rooted  DAGs,  we  can 
avoid  evaluating  the  conditions  above.  In  addition,  we  know  that  only  a  few 
weights  will  change  as  the  result  of  new  node  correspondence,  and  so  we  only 
need  to  update  those  entries  of  the  matrix.  This  can  be  done  efficiently  by 
designing  a  data  structure  that  simplifies  the  access  to  the  weights  that  are  to 
be  updated.  Alternatively,  the  update  step  can  also  be  efficiently  implemented 
with  matrices  by  noticing  that  the  column  of  A®  corresponding  to  node  u  tells 
us  all  the  parents  of  u,  while  the  row  of  T©  corresponding  to  node  v  give  us  all 
the  descendants  of  v.  Thus,  given  a  node  pair  (g',  m ')  and  their  corresponding 
Aq,  Tq,  Aot,  and  Tan,  h  is  straightforward  to  select  and  modify  only  those 
entries  of  W”m  that  need  to  be  updated  at  each  iteration  of  the  algorithm. 

A  careful  look  at  the  algorithm  as  it  has  been  stated  so  far  will  reveal  that, 
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in  general,  the  first  node  correspondences  found  will  be  those  among  lower- 
level  nodes  in  the  hierarchy.  We  can  expect  this  bottom-up  behavior  of  the 
algorithm  because  the  lower-level  nodes  carry  less  structural  information  and 
so  their  weight  will  be  less  affected  by  the  structural  difference  of  the  graphs 
rooted  at  them.  Therefore,  nodes  at  the  bottom  of  the  hierarchy  will  tend  to 
have  high  similarity  values  and  consequently,  they  will  be  chosen  to  split  the 
graphs,  creating  small  DAG’s  with  few  constraints  on  the  nodes. 

A  solution  to  this  problem  is  to  redefine  the  way  we  choose  the  best  edge  from 
the  bipartite  matching.  Instead  of  simply  choosing  the  edge  with  greatest 
weight,  we  will  also  consider  the  order6  of  the  DAG  rooted  at  the  matched 
nodes  to  select  the  pair  of  nodes  that  have  a  large  similarity  weight  and  are 
also  roots  of  large  subgraphs.  We  define  the  mass,  m(v),  of  node  v  as  the  order, 
n(T),  of  the  DAG  rooted  at  v.  For  a  given  graph  ©(D,  E),  the  D| -dimensional 
mass  vector,  M©,  in  which  each  of  its  dimensions  is  the  mass  m{y )  of  a  distinct 
v  G  V,  can  be  computed  from  the  transitive  closure  matrix,  T©,  of  the  graph 
by  M©  =  T©  x  1,  where  1  is  the  D  (-dimensional  vector  whose  elements  are 
all  equal  to  1.  Thus,  M©  is  a  vector  in  which  each  element  M©(u),  for  v  G  V, 
is  the  number  of  nodes  in  the  DAG  rooted  at  v. 

Unfortunately,  the  mass  does  not  give  us  enough  information  about  the  depth 
of  the  subgraph  rooted  at  a  node  since,  for  example,  the  path  of  n  nodes 
has  the  same  mass  as  the  star  of  n  nodes.  A  better  idea  is  to  consider  the 
cumulative  mass,  m.  Let  m{y)  be  defined  as  the  sum  of  all  the  masses  of  the 
nodes  of  the  DAG  rooted  at  v.  Thus,  the  cumulative  mass  vector  will  be  given 

6  Here  we  follow  the  convention  in  the  Graph  Theory  literature  that  considers  the 
order  of  a  graph  to  be  the  number  of  nodes  in  the  graph,  and  the  size  of  the  graph 
to  be  the  number  of  edges  in  the  graph. 
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by  M®  =  T®  x  M®,  which  can  also  be  written  as  M®  =  T®  x  1.  This  vector 
can  then  be  used  to  obtain  a  relative  measure  of  how  tall  and  wide  the  rooted 
subgraphs  are  with  respect  to  the  graph  they  belong  to,  by  simply  normalizing 
the  masses.  Let  M®  be  the  normalized  cumulative  mass  vector  given  by 


M®  — 


M 


© 


max  v  e  V  |M®(u)|’ 


(9) 


where  the  normalizing  factor  will  correspond  to  the  cumulative  mass  of  the 
node  whose  in-degree  is  zero  — a  root —  and  has  the  greatest  cumulative  mass 
in  0. 


The  cumulative  mass  is  exactly  the  piece  of  information  we  need,  since  it 
should  be  easy  to  see  that  for  all  the  trees  with  n  nodes,  the  star  is  the  one 
with  smallest  cumulative  mass,  while  the  path  is  the  one  with  the  greatest. 
Hence,  the  cumulative  mass,  m,  for  the  root  of  a  tree  of  order  n  satisfies 
2n  —  1  <  m  <  \n(n  +  1).  This  measure  is  a  good  indicator  of  how  deep  and 
wide  a  subtree  is,  and  so  provides  a  means  to  find  a  compromise  between  the 
node  similarities  and  their  positions  in  the  graph. 

We  can  then  promote  a  top-down  behavior  in  the  algorithm  by  selecting  the 
match  ( q ,  m)+  from  the  list,  £,  returned  by  each  MCMW  bipartite  matching, 
with  the  maximum  convex  sum  of  the  similarity  and  the  relative  mass  of  the 
matched  nodes, 

(q,  m)+  =  argmax(?m)e£  {yW^  +  (1  -  7)  max(MQ(g),  MOT(m))}  ,  (10) 


where  0  <  7  <  1  is  a  real  value  that  controls  the  influence  of  the  relative 
cumulative  mass  in  selecting  the  best  match.  Since  we  want  to  promote  a  top- 
down  association  of  nodes  without  distorting  the  actual  node  similarities,  we 
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Fig.  9.  An  example  in  which  7  <  1  can  promote  a  top-down  behavior  in  the  al¬ 
gorithm.  The  cumulative  mass  of  each  node  is  shown  in  blue.  Edge  weights  are 
computed  according  to  Equation  10,  for  7  =  1  and  7  =  0.7.  For  the  given  set  of 
node  similarities  and  7  =  1,  the  best  node  correspondence  at  this  iteration  is  the 
pair  of  leaves  (q^rru),  whereas  for  7  =  0.7,  the  best  node  correspondence  is  the 
pair  of  non-terminal  nodes  (^4,  m2). 

suggest  7  to  be  in  the  interval  [0.7,  0.9].  In  Figure  9,  we  compare  the  sequences 
of  graph  splits  using  different  values  for  7.  When  7  =  1,  we  obtain  the  original 
equation  in  [39]  that,  as  can  be  seen  in  the  figure,  tends  to  produce  a  bottom- 
up  behavior  of  the  algorithm. 


Given  the  set  of  node  correspondences  between  two  graphs,  the  final  step  is 
to  compute  an  overall  measure  of  graph  similarity.  The  similarity  of  the  query 
graph  to  the  model  graph  is  given  by 


cr*(0,27t) 


(?7.£j  +  X)(q,m)+G<£  q,m 

2  nQnm 


(11) 


where  uq  and  n^x  are  the  orders  of  the  query  graph  and  the  model  graph, 
respectively. 


The  graph  similarity  is  given  by  a  weighted  average  of  the  number  of  matched 


nodes  in  the  query  and  in  the  model,  where  the  weights  are  given  by  the  node 
similarity  of  each  matched  node  pair.  If  all  the  query  nodes  are  matched  with 
similarity  1,  i.e.,  their  attributes  are  identical,  we  have  Xq9,m)+e$  Wg>m  =  hq, 
and  so  cr<j> (jO,  9JT)  =  +  1)-  Since  all  query  nodes  have  been  matched,  we 

know  that  >  n^,  and  so  (7$  (0,971)  will  be  one  when  all  the  model  nodes 
are  mapped,  and  less  than  one  otherwise.  Therefore,  the  graph  similarity  is 
proportional  to  the  quality  of  each  pair  of  node  correspondences,  and  inversely 
proportional  to  the  number  of  unmatched  nodes,  both  in  the  query  and  in  the 
model.  Hence,  a  model  that  contains  the  query  as  a  relatively  small  subgraph 
is  not  as  good  a  match  as  a  model  for  which  most  of  nodes  match  those  of  the 
query  graph,  and  vice  versa. 

Finally,  it  should  be  noted  that  the  relative  weighting  of  the  topological  and 
geometric  terms  in  the  bipartite  graph  edge  weights  need  not  be  constant  for  all 
edges.  Since  each  edge  spans  an  image  node  and  a  model  node,  the  model  can 
be  used  to  define  an  a  priori  weighting  scheme  that  is  edge  dependent.  Thus,  if 
portions  of  the  model  were  more  geometrically  constrained  (e.g.,  articulation 
was  prohibited),  those  model  nodes  could  have  a  higher  weighting  on  their 
geometric  similarity  component.  Similarly,  portions  of  the  model  that  were 
less  constrained  could  have  a  higher  weighting  on  the  topological  similarity 
component.  This  is  a  very  powerful  feature  of  the  algorithm,  allowing  the 
incorporation  of  local  model  constraints  into  the  matching  algorithm. 

The  final  algorithm  is  shown  in  Figure  10.  The  first  step  of  the  algorithm  is  to 
compute  a  node  similarity  matrix,  the  transitive  closure  matrices,  the  sibling 
matrices,  and  the  node  TSVs  for  both  graphs.  Assuming  a  linear  algorithm  for 
the  pairwise  node  similarities,  the  former  matrix  can  be  computed  in  0(n3). 
The  other  matrices  can,  in  turn,  be  obtained  in  linear  time  and  in  quadratic 
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procedure  isomorphism ( £2, 9H) 

<f>(£2,  9H)  <—  0  ;solution  set 

compute  the  tiq  x  weight  matrix  W°  from  Eq.  6 
Tq  =  compute  transitive  closure  matrix  from  Aq 
Tsot  =  compute  transitive  closure  matrix  from  A 
Sq  =  compute  SSD  matrix  from  Aq  and  Tq 
=  compute  SSD  matrix  from  Agrjj  and  Tgjt 
compute  the  TSV  of  each  nonterminal  node  in  £2  and  9H 
unmark  all  nodes  in  £2  and  in  9H 
call  match(root(£2),root(9H)) 
return(cr$(l},  9H)) 

end 

procedure  match(u,u) 

do 

{ 

let  Qu  <—  u  rooted  unmarked  subgraph  of  £2 

let  VJlv  v  rooted  unmarked  subgraph  of  9H 

£  max  cardinality,  max  weight  bipartite  matching  between 

unmarked  nodes  in  <3(Vqu  ,  Vmv )  with  weights  from  Wn+1  (see[15]) 
(u',v')  choose  max  weight  pair  in  £  from  Eq.  10 
3>(£2,  9H)  <-  $(£,  SDI)  U  {«  v')} 

update  the  similarity  matrix  W71-*"1  according  to  Eq.  8 
mark  u' 
mark  v' 

call  match(n/,u/) 
call  match(n,u) 

} 

while  (£3„  ^  0  and  9Jl„  ^  0) 


Fig.  10.  Algorithm  for  Matching  Two  Directed  Acyclic  Graphs 


time,  respectively.  At  each  iteration  of  the  algorithm,  we  have  to  compute  a 
MCMW  bipartite  matching,  sort  its  output,  and  update  the  similarity  ma¬ 
trix.  The  complexity  at  each  step  is  then  determined  by  that  of  the  bipartite 
matching  algorithm,  0(|V||.E|),  since  it  is  the  most  complex  operation  of  the 
three.  The  number  of  iterations  is  bounded  by  min(riQ,  nan),  and  so  the  overall 
complexity  of  the  algorithm  is  0{n3).  Hence,  we  have  provided  the  algorithm 
with  important  properties  for  the  matching  process  while  maintaining  its  orig¬ 
inal  complexity.  An  example  of  the  blob  correspondence  computed  over  two 
hand  exemplars  is  shown  in  Figure  11. 
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Fig.  11.  Example  Correspondence  Computed  between  Two  Blob  Graphs 

5  Experiments 

We  evaluate  our  framework  on  the  domain  of  view-based  3-D  object  recog¬ 
nition  where  the  objective  is  to  choose  the  right  object  (identification)  for  a 
particular  query  view  and  also  to  determine  its  correct  pose  (pose  estimation). 
To  provide  a  comprehensive  evaluation,  we  used  two  popular  image  libraries; 
the  Columbia  University  COIL-20  (20  objects,  72  views  per  object)  [30]  and 
the  ETH  Zurich  ETH-80  (8  categories,  10  exemplars  per  category,  41  views 
per  exemplar)  [22],  Sample  views  of  objects  from  these  two  libraries  are  shown 
in  Figure  12.  Note  that  in  the  ETH-80  library,  some  categories  have  very  sim¬ 
ilar  shape,  differing  only  in  their  appearance.  Thus,  the  horse,  dog,  and  cow 
categories  were  collapsed  to  form  a  4-legged  animal  category,  the  apple  and 


31 


Fig.  12.  Views  of  sample  objects  from  the  Columbia  University  Image  Library 
(COIL-20)  and  the  ETH  Zurich  (ETH-80)  Image  Set. 

tomato  categories  were  collapsed  to  form  a  spherical  fruit  category,  and  the 
two  car  instance  categories  were  collapsed  to  form  a  single  car  category. 

We  applied  the  following  procedure  to  each  database  to  evaluate  the  proposed 
framework.  We  initially  removed  the  first  entry  from  the  database,  used  it  as 
a  query,  and  computed  its  similarity  with  each  of  the  remaining  views  in  the 
database.  We  then  returned  the  query  back  to  the  database  and  repeated  the 
same  process  for  each  of  the  the  other  database  entries.  This  process  results  in 
a  n  x  n  similarity  matrix,  where  the  entry  (i,j)  indicates  how  similar  views  i 
and  j  are.  For  a  particular  query,  we  classify  its  identification  as  correct  if  the 
maximum  similarity  is  obtained  with  a  view  which  belongs  to  the  same  object 
as  the  query.  Consequently,  pose  estimation  is  correct  if  view  i  of  object  j,  vt)J 
matches  most  closely  with  vnj,  where  n  is  one  of  V s  neighboring  views. 

Our  overall  recognition  rates  for  COIL-20  and  ETH-80  datasets  are  93.5%  and 
97.1%,  respectively.  We  show  a  part  of  the  matching  results  in  Table  1.  Upon 
investigation  as  to  why  the  COIL-20  dataset  yields  poorer  performance,  we 
found  that  most  of  the  mismatches  were  between  three  different  car  objects: 
column  three  of  the  first  row,  column  one  of  the  second  row,  and  column  four 
of  the  fourth  row,  as  shown  in  the  left  of  Figure  12.  Despite  being  different 
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Query  Top  9  Matched  Objects 


Table  1 

Top  matched  models  are  sorted  by  the  similarity  to  the  query, 
objects  with  different  appearance,  their  coarse  shape  structure  is  similar  and 
their  blob  graphs  are  indeed  similar.  If  we  group  these  three  exemplars  into 
the  same  category  and  count  these  matches  as  correct,  our  recognition  rate 
rises  to  96.5%.  Our  recognition  framework  is  clearly  suited  to  coarse  shape 
categorization  as  opposed  to  exemplar  matching. 

For  pose  estimation,  we  observe  that  in  all  but  9.8%  and  14.6%  of  the  COIL- 
20  and  ETH-80  experiments,  respectively,  the  closest  match  selected  by  our 
algorithm  was  a  neighboring  view.  Note  that  if  the  closest  matching  view  was 
not  an  immediate  neighbor  drawn  from  the  same  exemplar,  the  match  was 
deemed  incorrect,  despite  the  fact  that  the  matching  view  might  be  a  neigh- 
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boring  view  of  a  different  exmeplar  from  the  same  category.  This  is  perhaps 
overly  harsh,  as  reflected  by  the  14.6%  results,  but  view  alignment  between 
exemplars  belonging  to  the  same  category  was  not  provided.  These  results  can 
be  considered  worst  case  for  several  additional  reasons.  Given  the  high  simi¬ 
larity  among  neighboring  views,  it  could  be  argued  that  our  pose  estimation 
criterion  is  overly  harsh,  and  that  perhaps  a  measure  of  “viewpoint  distance” , 
i.e.,  “how  many  views  away  was  the  closest  match”  would  be  less  severe.  In 
any  case,  we  anticipate  that  with  fewer  samples  per  object,  neighboring  views 
would  be  more  dissimilar,  and  our  matching  results  would  improve.  More  im¬ 
portantly,  many  of  the  objects  are  rotationally  symmetric,  and  if  a  query  has 
an  identical  view  elsewhere  in  the  dataset,  that  view  might  be  chosen  (with 
equal  similarity)  and  scored  as  an  error. 


To  demonstrate  the  framework’s  robustness,  we  performed  five  perturbation 
experiments  on  both  datasets.  The  experiments  were  identical  to  the  experi¬ 
ments  above,  except  that  we  randomly  chose  a  node,  v,  in  the  query  graph, 
if  the  number  of  nodes  in  the  directed  acyclic  subgraph  rooted  at  v  was  less 
than  10%  of  the  number  of  nodes  in  the  original  graph,  we  deleted  the  rooted 
subgraph  from  the  query.  We  then  repeated  the  same  process  for  maximum 
ratios  of  20%,  30%,  40%,  and  50%.  The  results  are  shown  in  Table  2,  and 
reveal  that  the  error  rate  increases  gracefully  as  a  function  of  increased  per¬ 
turbation.  Although  not  a  true  occlusion  experiment,  which  would  require 
that  we  replace  the  removed  subgraph  with  an  occluder  subgraph,  these  re¬ 
sults  demonstrate  the  framework’s  ability  to  match  local  structure,  a  property 
essential  for  handling  occlusion. 
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Perturbation 

10% 

20% 

30% 

40% 

50% 

Recognition  rate  (COIL-20) 

91.2% 

89.5% 

87.3% 

83.7% 

78.6% 

Recognition  rate  (ETH-80) 

94.2% 

91.5% 

89.3% 

84.7% 

82.6% 

Table  2 

Recognition  rate  as  a  function  of  increasing  perturbation  in  the  form  of  missing  data. 
Percentages  indicate  how  much  of  the  query  graph  was  removed  prior  to  matching. 


6  Limitations 


Both  the  representation  and  matching  components  of  onr  integrated  frame¬ 
work  have  limitations.  Since  it  is  based  on  image  gradients,  the  blob  and  ridge 
decomposition  does  not  perform  well  in  the  presence  of  textured  surfaces,  and 
spurious  and  missing  blobs  may  result.  Although  the  matching  algorithm  can 
accommodate  both  noise  and  occlusion,  it  does  rely  on  there  being  a  sufficient 
number  of  one-to-one  correspondences  to  discriminate  the  correct  model  from 
other  models.  If  blobs  are  highly  over-  or  under-segmented,  matching  may  fail 
as  too  few  one-to-one  correspondences  may  exist. 


7  Conclusions 

Matching  two  images  whose  similarity  exists  at  the  coarse  shape  level  is  critical 
to  object  categorization.  Blobs  and  ridges  provide  an  ideal  multiscale  part 
vocabulary  for  coarse  shape  modeling  which,  when  combined  with  an  array 
of  geometric  relations  in  the  form  of  a  graph,  yield  a  powerful  categorical 
shape  representation,  provide  a  powerful,  hierarchical  characterization  of  an 
object’s  coarse  shape.  Our  inexact  graph  matching  framework  exploits  both 
the  topological  as  well  as  the  geometrical  relations  in  a  directed  acyclic  graph 
to  yield  an  efficient  algorithm  for  coarse-to-ffiie  shape  matching.  We  have 
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demonstrated  the  generality  of  the  framework  by  applying  it  to  three  different 
domains  (without  domain-specific  tuning),  with  very  encouraging  results  in 
each  domain. 
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